On cherche à étudier l’effet de trois facteurs sur le transcriptome des racines d’Arabidopsis. On cherche ici à utiliser et comparer différents packages de multiDE sur nos données.
data <- read.csv("quantifFiles/quantifGenesTomate.csv", h = T, sep = ",")
rownames(data) <- data$Gene
genes = which(!(grepl("__", rownames(data))))
not_quant = data[which((grepl("__", rownames(data)))), ]
data = data[genes, grepl("R", colnames(data))]
head(data) R48 R39 R38 R28 R29 R30 R25 R31 R27 R33 R32 R26 R36 R37 R35 R34
Sly00g0382731 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Sly00g0382741 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Sly00g0382751 400 320 259 280 464 370 228 321 282 348 261 302 309 306 264 238
R47 R46 R44 R45 R41 R40 R42 R43
Sly00g0382731 2 0 0 0 0 0 0 0
Sly00g0382741 0 0 0 0 0 0 0 0
Sly00g0382751 312 269 300 223 552 342 401 216
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
[1] 39110 24
annot <- read.csv("Code_for_RNAseq_CO2_N_Fr.csv", h = T, sep = ";")
conditions <- as.vector(unique(annot$Sample))
annot$ID <- paste0("R", annot$Code)
annot$condition <- substr(conditions, 1, nchar(conditions) - 1)
cond <- unique(substr(conditions, 1, nchar(conditions) - 1))
cond <- cond[grepl("Sl", cond)]
getCondition <- function(id) {
return(annot[annot$ID == id, "condition"])
}
getExactCondition <- function(id) {
return(annot[annot$ID == id, "Sample"])
}
getLabel <- function(id) {
text <- as.vector(annot[annot$ID == id, "Sample"])
res <- ""
nb <- substr(text, nchar(text), nchar(text))
if (grepl("Ambient", text)) {
res = paste0(res, "c")
} else {
res = paste0(res, "C")
}
if (grepl("High", text)) {
res = paste0(res, "N")
} else {
res = paste0(res, "n")
}
if (grepl("Starv", text)) {
res = paste0(res, "f")
} else {
res = paste0(res, "F")
}
res = paste0(res, "_", nb)
return(res)
}
getLabel("R36")[1] "CnF_3"
[1] Sl_AmbientCO2_LowNitrate_FeStarvation1
48 Levels: At_AmbientCO2_HighNitrate_Fe1 ... Sl_ElevatedCO2_LowNitrate_FeStarvation3
[1] "At_AmbientCO2_LowNitrate_Fe"
TCC fait le test globalement, avec un layout “one way”, qui prend les 8 conditions comme toutes différentes. En spécifiant un design de combinatoire \(2*2*2\), on peut avoir des coefficients testés contre 0 pour chacune des variables et leur interactions.
glmLRT conducts likelihood ratio tests for one or more coefficients in the linear model. If coef is used, the null hypothesis is that all the coefficients indicated by coef are equal to zero. If contrast is non-null, then the null hypothesis is that the specified contrasts of the coefficients are equal to zero. For example, a contrast of c(0,1,-1), assuming there are three coefficients, would test the hypothesis that the second and third coefficients are equal.
While the likelihood ratio test is a more obvious choice for inferences with GLMs, the QLF-test is preferred as it reflects the uncertainty in estimating the dispersion for each gene. Itprovides more robust and reliable error rate control when the number of replicates is small.The QL dispersion estimation and hypothesis testing can be done by using the functionsglmQLFit()andglmQLFTest()
# data
d <- data
# fixation correcte des contrastes et formule
colnames(d) <- sapply(colnames(d), getLabel)
groups <- str_split_fixed(colnames(d), "_", 2)[, 1]
# colnames(d) <- sapply(colnames(d), getLabel)
y <- DGEList(counts = d, group = groups)
y$samples group lib.size norm.factors
Cnf_3 Cnf 26741173 1
cNf_3 cNf 21291368 1
cNf_2 cNf 23788658 1
cnF_1 cnF 22016602 1
cnF_2 cnF 29965632 1
cnF_3 cnF 31788665 1
cNF_1 cNF 20074554 1
CNF_1 CNF 27503057 1
cNF_3 cNF 23782396 1
CNF_3 CNF 29424187 1
CNF_2 CNF 26687502 1
cNF_2 cNF 23004843 1
CnF_3 CnF 30395731 1
cNf_1 cNf 25108644 1
CnF_2 CnF 20428037 1
CnF_1 CnF 21068589 1
Cnf_2 Cnf 24932215 1
Cnf_1 Cnf 28581275 1
CNf_2 CNf 25763959 1
CNf_3 CNf 23693250 1
cnf_2 cnf 35031098 1
cnf_1 cnf 25436178 1
cnf_3 cnf 27074819 1
CNf_1 CNf 24584667 1
# filtre
keep <- filterByExpr(y)
y <- y[keep, keep.lib.sizes = FALSE]
# normalisation
y <- calcNormFactors(y)
# norm <- cpm(y, normalized.lib.sizes=TRUE) not_norm <- cpm(y,
# normalized.lib.sizes=FALSE) rd_genes = sample(rownames(norm), 500)
# heatmap(not_norm[rd_genes,], main = 'Sans normalisation')
# heatmap(norm[rd_genes,], main = 'Avec normalisation')
# Definition du design et des bonnes conditions de référence
# design <- model.matrix(~0 + groups) con <- makeContrasts((groupscnF -
# groupscnf), levels=design) fit <- glmQLFit(y, contrast = con) qlf.2vs1 <-
# glmQLFTest(fit) a = topTags(qlf.2vs1, n= 1000) a$table
# estimation of dispersion and tests
y <- edgeR::estimateDisp(y)Design matrix not provided. Switch to the classic mode.
pval = 0.01
et <- exactTest(y)
best = topTags(et, n = 20000)
best = best[best$table$FDR < pval, ]
best = best[abs(best$table$logFC) > 1, ]
# genes <- rownames(cpm(y, normalized.lib.sizes=TRUE))
DEgenes = rownames(best$table)
plotBCV(y)On utilise des modèles de mélange pour regrouper les gènes ayant des profils d’expression similaires dans les différentes conditions.
Sly02g0349581 Sly05g0137921 Sly04g0251231 Sly06g0375851 Sly08g0273531
66563 13206 57922 56933 6862
Sly08g0282161 Sly10g0190401 Sly12g0071731 Sly11g0288951 Sly05g0155731
13240 60057 63640 72422 23644
Sly04g0251211 Sly12g0056631 Sly03g0209731 Sly01g0030961 Sly03g0228011
36456 65188 63184 106750 15113
Sly06g0370891 Sly02g0341281 Sly05g0142161 Sly10g0184191 Sly02g0324461
43186 27735 28696 9770 8643
Sly05g0157311 Sly06g0371401 Sly01g0000401 Sly03g0209751 Sly08g0262341
354599 62243 6741 50002 15671
Sly06g0352901 Sly07g0126971 Sly06g0353681 Sly12g0053661 Sly03g0222691
7532 109995 34317 60176 12172
Sly08g0263181 Sly04g0251201 Sly12g0070721 Sly05g0156491 Sly01g0035861
28078 77220 179473 20904 123637
Sly02g0349231 Sly03g0223341 Sly01g0022341 Sly12g0067311 Sly04g0247491
15702 90410 48343 9501 161474
Sly09g0095061 Sly11g0294281 Sly05g0145111 Sly01g0039461 Sly09g0099711
6223 18517 18702 26841 36932
Sly05g0143011 Sly06g0370181 Sly01g0003001 Sly05g0138461 Sly11g0307721
20587 177001 27616 36703 7975
Sly10g0183051 Sly03g0209741 Sly08g0283221 Sly06g0372181 Sly06g0355821
17552 67718 6226 9483 411
Sly06g0366131 Sly07g0119791 Sly02g0328831 Sly10g0184161 Sly09g0084621
89215 18801 83575 6174 70287
Sly08g0264571 Sly00g0386741 Sly02g0342271 Sly06g0364641 Sly06g0377981
22218 56904 20350 5730 3516
Sly11g0295831 Sly02g0346201 Sly06g0367601 Sly07g0110001 Sly01g0021751
231839 108176 65653 192025 17463
Sly05g0148761 Sly06g0371951 Sly06g0363891 Sly01g0005461 Sly08g0264231
159760 49045 19392 9767 1604
[ reached getOption("max.print") -- omitted 748 entries ]
conds <- str_split_fixed(colnames(dataC), "_", 2)[, 1]
run_pois <- coseq(dataC, conds = conds, K = 8:12, model = "Poisson", iter = 5, transformation = "none")****************************************
coseq analysis: Poisson approach & none transformation
K = 8 to 12
Use set.seed() prior to running coseq for reproducible results.
****************************************
Running g = 8 ...
[1] "Initialization: 1"
[1] "Log-like diff: 200.733285161674"
[1] "Log-like diff: 34.0168501515287"
[1] "Log-like diff: 4.67799934004114"
[1] "Log-like diff: 4.18226338077555"
[1] "Log-like diff: 0.014664002213415"
Running g = 9 ...
[1] "Initialization: 1"
[1] "Log-like diff: 0.00018384684975814"
[1] "Log-like diff: 2.05128579917613e-05"
[1] "Log-like diff: 2.28306505434261e-06"
Running g = 10 ...
[1] "Initialization: 1"
[1] "Log-like diff: 1.23397612128429e-08"
Running g = 11 ...
[1] "Initialization: 1"
[1] "Log-like diff: 3.46912272591684"
[1] "Log-like diff: 0.895422499903454"
[1] "Log-like diff: 0.137634506809182"
[1] "Log-like diff: 0.0176713797276591"
[1] "Log-like diff: 0.00353622179219926"
Running g = 12 ...
[1] "Initialization: 1"
[1] "Log-like diff: 0.120321852380055"
[1] "Log-like diff: 0.0241292474933132"
[1] "Log-like diff: 0.00128440542239971"
[1] "Log-like diff: 0.000177024889818966"
[1] "Log-like diff: 2.41107373533112e-05"
$logLike
$ICL
$profiles
$boxplots
$probapost_boxplots
$probapost_barplots
$probapost_histogram
*************************************************
Model: Poisson
Transformation: none
*************************************************
Clusters fit: 8,9,10,11,12
Clusters with errors: ---
Selected number of clusters via ICL: 12
ICL of selected model: -1703180
*************************************************
Number of clusters = 12
ICL = -1703180
*************************************************
Cluster sizes:
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
34 90 125 80 85 94 83
Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12
29 14 33 59 97
Number of observations with MAP > 0.90 (% of total):
816 (99.15%)
Number of observations with MAP > 0.90 per cluster (% of total per cluster):
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
34 90 123 79 85 94 82
(100%) (100%) (98.4%) (98.75%) (100%) (100%) (98.8%)
Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12
28 14 33 58 96
(96.55%) (100%) (100%) (98.31%) (98.97%)
On représente le clustering dans la plan principal d’une ACP
suppressMessages(library(ade4, warn.conflicts = F, quietly = T))
suppressMessages(library(adegraphics, warn.conflicts = F, quietly = T))
# fixation correcte des contrastes et formule
acp <- dudi.pca(log(dataC + 0.1), center = TRUE, scale = TRUE, scannf = FALSE, nf = 4)
summary(acp)Class: pca dudi
Call: dudi.pca(df = log(dataC + 0.1), center = TRUE, scale = TRUE,
scannf = FALSE, nf = 4)
Total inertia: 24
Eigenvalues:
Ax1 Ax2 Ax3 Ax4 Ax5
19.2464 1.5504 1.3660 0.6594 0.2463
Projected inertia (%):
Ax1 Ax2 Ax3 Ax4 Ax5
80.193 6.460 5.692 2.747 1.026
Cumulative projected inertia (%):
Ax1 Ax1:2 Ax1:3 Ax1:4 Ax1:5
80.19 86.65 92.34 95.09 96.12
(Only 5 dimensions (out of 24) are shown)
library("RColorBrewer")
dataC$cluster = clusters_per_genes[as.vector(rownames(dataC))]
s.corcircle(acp$co, xax = 1, yax = 2, fullcircle = FALSE, pback.col = "lightgrey")adegraphics::s.class(acp$li, xax = 1, yax = 2, as.factor(dataC$cluster), labels = as.character(levels(as.factor(dataC$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9))adegraphics::s.class(acp$li, xax = 2, yax = 3, as.factor(dataC$cluster), labels = as.character(levels(as.factor(dataC$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9))adegraphics::s.class(acp$li, xax = 4, yax = 2, as.factor(dataC$cluster), labels = as.character(levels(as.factor(dataC$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9))adegraphics::s.class(acp$li, xax = 4, yax = 3, as.factor(dataC$cluster), labels = as.character(levels(as.factor(dataC$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9)) Il semblerait que l’ACP détecte dans un premier temps (avec le premier vecteur principal) la valeur moyenne d’expression, puis l’expression diff induite par le fer.
On peut faire des représentations dans le plan des seconds et troisième axes principaux (le premier traduit le fer, le second la carence en nitrate).
Sur le cercle de corrélations dans le plan 2 et 3, on voit bien que lorsque la carence fer est là, on peut différencier un effet nitrate. Le CO2 est, lui encore difficile à identifier, le 4eme axe de l’ACP ne renseignant pas beaucoup…